{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "64595523",
   "metadata": {},
   "source": [
    "# Faster Prediction with TensorRT\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/autogluon/autogluon/blob/master/docs/tutorials/multimodal/advanced_topics/tensorrt.ipynb)\n",
    "[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/autogluon/autogluon/blob/master/docs/tutorials/multimodal/advanced_topics/tensorrt.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f900433",
   "metadata": {},
   "source": [
    "[TensorRT](https://developer.nvidia.com/tensorrt), built on the NVIDIA CUDA® parallel programming model, enables us to optimize inference by leveraging libraries, development tools, and technologies in NVIDIA AI, autonomous machines, high-performance computing, and graphics. AutoGluon-MultiModal is now integrated with TensorRT via `predictor.optimize_for_inference()` interface. This tutorial demonstates how to leverage TensorRT in boosting inference speed, which would be helpful in increasing efficiency at deployment environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "00112177",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import numpy as np\n",
    "import time\n",
    "import warnings\n",
    "from IPython.display import clear_output\n",
    "warnings.filterwarnings('ignore')\n",
    "np.random.seed(123)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5fc9bd87",
   "metadata": {},
   "source": [
    "### Install required packages\n",
    "Since the tensorrt/onnx/onnxruntime-gpu packages are currently optional dependencies of autogluon.multimodal, we need to ensure these packages are correctly installed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14ca0d8b",
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    import tensorrt, onnx, onnxruntime\n",
    "    print(f\"tensorrt=={tensorrt.__version__}, onnx=={onnx.__version__}, onnxruntime=={onnxruntime.__version__}\")\n",
    "except ImportError:\n",
    "    !pip install autogluon.multimodal[tests]\n",
    "    !pip install -U \"tensorrt>=10.0.0b0,<11.0\"\n",
    "    clear_output()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48bcdc4f",
   "metadata": {},
   "source": [
    "## Dataset\n",
    "\n",
    "For demonstration, we use a simplified and subsampled version of [PetFinder dataset](https://www.kaggle.com/c/petfinder-adoption-prediction). The task is to predict the animals' adoption rates based on their adoption profile information. In this simplified version, the adoption speed is grouped into two categories: 0 (slow) and 1 (fast).\n",
    "\n",
    "To get started, let's download and prepare the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d6e63f8",
   "metadata": {},
   "outputs": [],
   "source": [
    "download_dir = './ag_automm_tutorial'\n",
    "zip_file = 'https://automl-mm-bench.s3.amazonaws.com/petfinder_for_tutorial.zip'\n",
    "from autogluon.core.utils.loaders import load_zip\n",
    "load_zip.unzip(zip_file, unzip_dir=download_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a1e7ea6e",
   "metadata": {},
   "source": [
    "Next, we will load the CSV files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4dab1073",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "dataset_path = download_dir + '/petfinder_for_tutorial'\n",
    "train_data = pd.read_csv(f'{dataset_path}/train.csv', index_col=0)\n",
    "test_data = pd.read_csv(f'{dataset_path}/test.csv', index_col=0)\n",
    "label_col = 'AdoptionSpeed'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa15cd70",
   "metadata": {},
   "source": [
    "We need to expand the image paths to load them in training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9d3cdcb5",
   "metadata": {},
   "outputs": [],
   "source": [
    "image_col = 'Images'\n",
    "train_data[image_col] = train_data[image_col].apply(lambda ele: ele.split(';')[0]) # Use the first image for a quick tutorial\n",
    "test_data[image_col] = test_data[image_col].apply(lambda ele: ele.split(';')[0])\n",
    "\n",
    "def path_expander(path, base_folder):\n",
    "    path_l = path.split(';')\n",
    "    return ';'.join([os.path.abspath(os.path.join(base_folder, path)) for path in path_l])\n",
    "\n",
    "train_data[image_col] = train_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))\n",
    "test_data[image_col] = test_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "184f5f90",
   "metadata": {},
   "source": [
    "Each animal's adoption profile includes pictures, a text description, and various tabular features such as age, breed, name, color, and more.\n",
    "\n",
    "## Training\n",
    "Now let's fit the predictor with the training data. Here we set a tight time budget for a quick demo."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eb2e35ef",
   "metadata": {},
   "outputs": [],
   "source": [
    "from autogluon.multimodal import MultiModalPredictor\n",
    "hyperparameters = {\n",
    "    \"optim.max_epochs\": 2,\n",
    "    \"model.names\": [\"numerical_mlp\", \"categorical_mlp\", \"timm_image\", \"hf_text\", \"fusion_mlp\"],\n",
    "    \"model.timm_image.checkpoint_name\": \"mobilenetv3_small_100\",\n",
    "    \"model.hf_text.checkpoint_name\": \"google/electra-small-discriminator\",\n",
    "    \n",
    "}\n",
    "predictor = MultiModalPredictor(label=label_col).fit(\n",
    "    train_data=train_data,\n",
    "    hyperparameters=hyperparameters,\n",
    "    time_limit=120, # seconds\n",
    ")\n",
    "\n",
    "clear_output()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3cffaf2c",
   "metadata": {},
   "source": [
    "Under the hood, AutoMM automatically infers the problem type (classification or regression), detects the data modalities, selects the related models from the multimodal model pools, and trains the selected models. If multiple backbones are available, AutoMM appends a late-fusion model (MLP or transformer) on top of them.\n",
    "\n",
    "## Prediction with default PyTorch module\n",
    "Given a multimodal dataframe without the label column, we can predict the labels.\n",
    "\n",
    "Note that we would use a small sample of test data here for benchmarking. Later, we would evaluate over the whole test dataset to assess accuracy loss."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27542849",
   "metadata": {},
   "outputs": [],
   "source": [
    "batch_size = 2\n",
    "n_trails = 10\n",
    "sample = test_data.head(batch_size)\n",
    "\n",
    "# Use first prediction for initialization (e.g., allocating memory)\n",
    "y_pred = predictor.predict_proba(sample)\n",
    "\n",
    "pred_time = []\n",
    "for _ in range(n_trails):\n",
    "    tic = time.time()\n",
    "    y_pred = predictor.predict_proba(sample)\n",
    "    elapsed = time.time()-tic\n",
    "    pred_time.append(elapsed)\n",
    "    print(f\"elapsed (pytorch): {elapsed*1000:.1f} ms (batch_size={batch_size})\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "adbff0ee",
   "metadata": {},
   "source": [
    "## Prediction with TensorRT module\n",
    "\n",
    "First, let's load a new predictor that optimize it for inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9eb56807",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_path = predictor.path\n",
    "trt_predictor = MultiModalPredictor.load(path=model_path)\n",
    "trt_predictor.optimize_for_inference()\n",
    "\n",
    "# Again, use first prediction for initialization (e.g., allocating memory)\n",
    "y_pred_trt = trt_predictor.predict_proba(sample)\n",
    "\n",
    "clear_output()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e3d2afb7",
   "metadata": {},
   "source": [
    "Under the hood, the `optimize_for_inference()` would generate an onnxruntime-based module that can be a drop-in replacement of torch.nn.Module. It would replace the internal torch-based module `predictor._model` for optimized inference.\n",
    "\n",
    "```{warning}\n",
    "The function `optimize_for_inference()` would modify internal model definition for inference only. Calling `predictor.fit()` after this would result in an error.\n",
    "It is recommended to reload the model with `MultiModalPredictor.load`, in order to refit the model.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24c14b17",
   "metadata": {},
   "source": [
    "Then, we can perform prediction or extract embeddings as usual. For fair inference speed comparison, here we run prediction multiple times."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "809fb6fc",
   "metadata": {},
   "outputs": [],
   "source": [
    "pred_time_trt = []\n",
    "for _ in range(n_trails):\n",
    "    tic = time.time()\n",
    "    y_pred_trt = trt_predictor.predict_proba(sample)\n",
    "    elapsed = time.time()-tic\n",
    "    pred_time_trt.append(elapsed)\n",
    "    print(f\"elapsed (tensorrt): {elapsed*1000:.1f} ms (batch_size={batch_size})\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f94d146",
   "metadata": {},
   "source": [
    "To verify the correctness of the prediction results, we can compare the results side-by-side.\n",
    "\n",
    "Let's take a peek at the expected results and TensorRT results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2f8e5022",
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred, y_pred_trt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08f29374",
   "metadata": {},
   "source": [
    "As we are using mixed precision (FP16) by default, there might be loss of accuracy. We can see the probabilities are quite close, and we should be able to safely assume these results are relatively close for most of the cases. Refer to [Reduced Precision section in TensorRT Developer Guide](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#reduced-precision) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2f447fc8",
   "metadata": {},
   "outputs": [],
   "source": [
    "np.testing.assert_allclose(y_pred, y_pred_trt, atol=0.01)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a76e541",
   "metadata": {},
   "source": [
    "### Visualize Inference Speed\n",
    "\n",
    "We can calculate inference time by dividing the prediction time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ea1e0e9",
   "metadata": {},
   "outputs": [],
   "source": [
    "infer_speed = batch_size/np.mean(pred_time)\n",
    "infer_speed_trt = batch_size/np.mean(pred_time_trt)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb5d4255",
   "metadata": {},
   "source": [
    "Then, visualize speed improvements."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9261e156",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "fig, ax = plt.subplots()\n",
    "fig.set_figheight(1.5)\n",
    "ax.barh([\"PyTorch\", \"TensorRT\"], [infer_speed, infer_speed_trt])\n",
    "ax.annotate(f\"{infer_speed:.1f} rows/s\", xy=(infer_speed, 0))\n",
    "ax.annotate(f\"{infer_speed_trt:.1f} rows/s\", xy=(infer_speed_trt, 1))\n",
    "_ = plt.xlabel('Inference Speed (rows per second)')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "952bb633",
   "metadata": {},
   "source": [
    "### Compare Evaluation Metric\n",
    "Now that we can achieve better inference speed with `optimize_for_inference()`, but is there any impact to the underlining accuracy loss?\n",
    "\n",
    "Let's start with whole test dataset evaluation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eb8ccfee",
   "metadata": {},
   "outputs": [],
   "source": [
    "metric = predictor.evaluate(test_data)\n",
    "metric_trt = trt_predictor.evaluate(test_data)\n",
    "clear_output()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7633720a",
   "metadata": {},
   "outputs": [],
   "source": [
    "metric_df = pd.DataFrame.from_dict({\"PyTorch\": metric, \"TensorRT\": metric_trt})\n",
    "metric_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae972756",
   "metadata": {},
   "source": [
    "The evaluation results are expected to be very close.\n",
    "\n",
    "In case there is any significant gap between the evaluation results, try disabling mixed precision by using CUDA execution provider:\n",
    "\n",
    "```python\n",
    "predictor.optimize_for_inference(providers=[\"CUDAExecutionProvider\"])\n",
    "```\n",
    "\n",
    "See [Execution Providers](https://onnxruntime.ai/docs/execution-providers/) for a full list of providers.\n",
    "\n",
    "## Other Examples\n",
    "\n",
    "You may go to [AutoMM Examples](https://github.com/autogluon/autogluon/tree/master/examples/automm) to explore other examples about AutoMM.\n",
    "\n",
    "## Customization\n",
    "To learn how to customize AutoMM, please refer to [Customize AutoMM](customization.ipynb)."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python",
   "pygments_lexer": "ipython"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
